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Abstract 

We consider a neutral dynamical model of biological diversity, where individuals live and repro- 
duce independently They have i.i.d. lifetime durations (which are not necessarily exponentially 
distributed) and give birth (singly) at constant rate b. Such a genealogical tree is usually called 
a splitting tree [8], and the population counting process (N t ;t > 0) is a homogeneous, binary 
Crump-Mode- Jagers process. 

We assume that individuals independently experience mutations at constant rate 9 during 
their lifetimes, under the infinite- alleles assumption: each mutation instantaneously confers a 
brand new type, called allele, to its carrier. We are interested in the allele frequency spectrum at 
time t, i.e., the number A(t) of distinct alleles represented in the population at time t, and more 
specifically, the numbers A(k, t) of alleles represented by k individuals at time t, k = 1, 2, . . . , N t . 

We mainly use two classes of tools: coalescent point processes, as defined in [13] . and branching 
processes counted by random characteristics, as defined in [101 111) . We provide explicit formulae 
for the expectation of A(k,t) conditional on population size in a coalescent point process, which 
apply to the special case of splitting trees. We separately derive the a.s. limits of A(k, t) /N t and 
of A(t)/N t thanks to random characteristics, in the same vein as in |18) . 

Last, we separately compute the expected homozygosity by applying a method introduced in 
[13) , characterizing the dynamics of the tree distribution as the origination time of the tree moves 
back in time, in the spirit of backward Kolmogorov equations. 
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1 Introduction 



We consider a general branching population, where individuals reproduce independently of each 
other, have i.i.d. lifetime durations, and give birth at constant rate during their lifetime. We also 
assume that each birth gives rise to a single newborn. The genealogical tree associated with this 
construction is known as a splitting tree Q3] . The process (N t ; t > 0) counting the population 
size is a non-Markovian birth-death process belonging to the class of general branching processes, 
or Crump-Mode-Jagers (CMJ) processes. Since births arrive singly and at constant rate, these 
processes are sometimes called homogeneous, binary CMJ processes. 

Next, individuals are given a type, called allele or haplotype. They inherit their type at birth 
from their mother, and (their germ line) can change type throughout their lifetime, at the points 
of independent Poisson point processes with rate 9, conditional on lifetimes (neutral mutations). 
The type conferred by a mutation is each time an entirely new type, an assumption known as the 
infinitely-many alleles model. 

We are interested in the so-called allelic partition (partition into types) of the population alive at 
time t. A convenient way of describing this partition without labelling types is to define the number 
Ag(k,t) of types carried by k individuals at time t. The sequence (Ag(k,t);k > 1) is called the 
frequency spectrum of the allelic partition. We also denote by Ag(t) the total number of distinct 
types at time t. The most celebrated mathematical result in this setting is Ewens' sampling formula, 
which yields the distribution of the frequency spectrum for the Kingman coalescent tree with neutral 
Poissonian mutations [6]. 

Credit is due to G. Yule [19J for the first study of a branching tree with mutations, but the 
interest for the infinitely-many alleles model applied to branching trees has started with the work 
of R.C. Griffiths and A.G. Pakes [9], where the tree under focus is a Galton- Watson tree and each 
individual, with a fixed probability, is independently declared mutant. A fascinating monography 
dedicated to general branching processes (also undergoing mutations at birth times) is due to Z. 
Tai'b [18J. An extensive use is done there of a.s. limit theorems for branching processes counted by 
random characteristics, due to P. Jagers and O. Nerman [101 [TT1 [T2| 115]. 

More recently, in a series of three companion papers, J. Bertoin [UEIE] has set up a very general 
framework for Galton- Watson processes with mutations, where he has considered the allelic partition 
of the whole population from origination to extinction, and studied various scaling limits for large 
initial population sizes and low mutation probabilities. Branching processes have also been used in 
the study of multistage carcinogenesis. In this setting, the emphasis is put on the waiting time until 
a target mutation occurs, see [T7] and the references therein. 

In this paper, we study the part of the frequency spectrum corresponding to families with a 
fixed number of carriers, that we call small families. We use three techniques: coalescent point 
processes, branching processes counted by random characteristics, and Kolomogorov-type equations 
as a function of the origination time of the tree. In a companion paper [3], we will discuss the part 
of the frequency spectrum corresponding to the largest or /and oldest families (the age of a family 
being that of their original mutation). 
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2 Model and statement of main results 



2.1 Model 

In this work, we consider genealogical trees satisfying the branching property and called splitting 
trees [TJ [8]. Splitting trees are those random trees where individuals' lifetime durations are i.i.d. with 
an arbitrary distribution, but where birth events occur at Poisson times during each individual's 
lifetime. We call b this constant birth rate and we denote by V a r.v. distributed as the lifetime 
duration. Then set A(dr) := b¥(V £ dr) a finite measure on (0, oo] with total mass b called the 
lifespan measure. We will always assume that a splitting tree is started with one unique progenitor 
born at time 0. 

The process (Nt,t > 0) counting the number of alive individuals at time t is a homogeneous, 
binary Crump-Mode- J agers process, which is not Markovian unless A has an exponential density or 
is the Dirac mass at {+oo}. 



9 10 12 13 14 15 




Figure 1: A coalescent point process for 16 individuals, hence 15 branches. 

In p3], it is shown that the genealogy of a splitting tree conditioned to be extant at a fixed time 
t is given by a coalescent point process, that is, a sequence of i.i.d. random variables Hi, i > 1, 
killed at its first value greater than t. In particular, conditional on iVj ^ 0, iV; follows a geometric 
ditribution with parameter P(if < t). More specifically, for any < i < Nt — 1, the coalescence 
time between the i-th alive individual at time t and the j-th individual alive at time t (i.e., the time 
elapsed since the common lineage to both individuals split into two distinct lineages) is the maximum 
of -f/j+i, . . . ,Hj. The graphical representation on Figured] is straightforward. The common law of 
these so-called branch lengths is given by 



(H > s) 



W(s)' 



(2.1) 



where the nondecreasing function W is such that W(0) = 1 and is characterized by its Laplace 
transform. More specifically, these branch lengths are the depths of the excursions of the jump 
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contour process, say Y^"\ of the splitting tree truncated below level t. They are i.i.d. because 
is a Markov process. Indeed, it is shown in |14j that Y^ has the law of a Levy process, say Y, with 
no negative jumps, reflected below t and killed upon hitting 0. The function W is called the scale 
function of Y, and is defined from the Laplace exponent tp of Y: 

ifi(x) = x- f (1 - e~ rx ) A(dr) x € R+. (2.2) 

■/(0,+oo] 

Let a denote the largest root of ip. In the supercritical case (i.e. f, Q , rA(dr) > 1), and in this 
case only, a is positive and called the Malthusian parameter, because the population size grows 
exponentially at rate a on the survival event. Then the function W is characterized by 

f 00 1 

J e ~ xrw{r)dr = W) x>a - 

Actually, it is possible to show by path decompositions of the process Y that 

W(x) = exp (b J dt¥(J > t) \ , 

where J is the maximum of the path of Y killed upon hitting and started from a random initial 
value, distributed as V. Note that since Y is also the contour process of a splitting tree, J has the 
law of the extinction time of the CMJ process N started from one individual. 

In the next section, we consider coalescent point processes without reference to a splitting tree. 
The law of such a process is merely characterized by a random number N of i.i.d. r.v. (Hi) 
independent of N, both with arbitrary distributions. In this setting, (|2.ip conversely serves as a 
definition ofW, which is now an arbitrary nondecreasing function, whereas it was previously seen to 
be differentiable in the special case of splitting trees. The population size N can be fixed (possibly 
infinite) or truly random, e.g. following a geometric distribution. It will be written Nt when the law 
of H is supported by [0,t]. In this latter case, any result obtained under the assumption that N 
follows a geometric distribution can be applied to the case of splitting trees. 

Throughout this work, we assume that individuals independently experience mutations at Poisson 
times during their lifetime, that each new mutation event confers a brand new type (called haplotype, 
or allele) to the individual, and that a newborn holds the same type as her mother at birth time. 
The mutation rate is denoted by 6. 



2.2 Outline and statement of main results 

The main technique we use relies on the previously described representation of the genealogy of 
a splitting tree by a sequence of i.i.d. r.v. (H{)i>x, called the coalescent point process (see also 
|16j for the critical, exponential case). The common distribution of H\, H2, ■ ■ ■ is related to the 
scale function W. We will also use the scale function Wg associated with the lifetime of clonal 
families (standard lifetime killed at its first mutation event). Section 3 is dedicated to some fine 
computations in the general framework of coalescent point processes. For example, for a coalescent 
point process (Hq, Hi, . . . , Hx) of age t, where X is an independent geometric r.v., Theorem 13.41 
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gives the expectation of Ag(k, t)u x . Various corollaries are stated, giving the expectation, sometimes 
conditional on the population size, of specific quantities of biological interest at the fixed time t. Those 
statements extend results of [13] given under a doubly asymptotic regime (t,n — > oo). For example, 
Corollary 13.51 gives the expectation of the number of distinct alleles and of homozygosity (probability 
of drawing two individuals carrying the same allele) and Corollary 13.111 gives the expectation of the 
number Zo(y;n) among the n first individuals who carry the ancestral type of lineage y units of 
time in the past 

n 

EZ (y;n) = e -^5>(# <y) k , 

k=0 

see Remark 13.121 for a simple interpretation of this formula. 

In Section 4, some of the previous results are specified to the case of splitting trees. In particular, 
Proposition 14.11 yields the expectation of Ag(k,t)u Nt , as well as of Zo(t)u Nt , where Zo(t) denotes the 
number of alive individuals at time t carrying the ancestral allele. The result for Ag(k, t) can even be 
detailed to the case of haplotypes of a given age. As previously, various corollaries are provided for 
some quantities such as the homozygosity. Ruling out the information on the population size (i.e., 
taking u = 1) and on the age of the mutation, Corollary 14.31 reads 



EM fl (k,t) = W(t) J dx e- dx 



Wg{x) 



and 

P* {Z (t) = k) = W(t) 



e -et / 1 sk-i 



W e {t) 2 V W e (t) 

where P* is the conditional probability on survival up until time t. Note also that Subsection 
provides the reader with a more explanatory proof of the previous formulae. 

The theory of random characteristics |10t fTTj \T7[ [j~5j 118] . which is the second main technique 
we use, is displayed in Section [5j There, the random characteristic of individual i, say, can be for 
example the number Xi(t) °f mutations that i has experienced during her lifetime and which are 
carried by k alive individuals, t units of time after her birth (xi(t) = if t < 0). Then the total 
number of haplotypes carried by k individuals at time t (except possibly the ancestral type) is the 
sum over all individuals i (dead or alive) of Xi(t — where <Tj is the birth time of individual i. 
Now according to limit theorems by P. Jagers and O. Nerman [10\ \TT\ [T2"l 115] . these sums converge 
a.s. on the survival event in the supercritical case. Exploiting those limit theorems, we are able to 
independently derive the following a.s. convergences in the supercritical case (see Proposition 15.11) . 
On the survival event, 

A e (k,t) U k 
hm — , , . = — a.s. 
t-^oo A e (t) U 

and 

V Mt) rr 

lim — — — = U a.s., 

t^oo N t 

where 

U k := / dx0e- 9x 



W e (x) 2 V W e (x) 
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and 



/ 

Jo 



■oo 



1 



dxO e 



W e {xY 



k>l 



In the final section (Section 6), we consider Ge(t) := Zo(t)(Zo(t) — l)/2 + Ylk>i ~~ l)Ao(k,t)/2, 
that we term absolute homozygosity, in reference to standard homozygosity, which is defined as 
Ge(t) = 2Gg(t)/Nt(Nt — 1). Homozygosity is a well-known measure of diversity, that can be seen 
as the probability that two randomly sampled distinct individuals (or sequences) share the same 
allele. In the spirit of backward Kolmogorov equations, we derive the dynamics of the expectation 
ofG (t)u Nt as the origination time of the tree moves back in time. Then the expected standard and 
absolute homozygosity can be computed. In passing, we recover formulae obtained in Section H] by 
totally different methods. Specifically, we get E*G e (t) = W(t)(W 2e (t) - 1). 

3 Expected haplotype frequencies for coalescent point processes 

In this section, unless otherwise specified, we assume that the lineage of individual 0, sometimes 
called lineage 0, is infinite, and that all other branch lengths are i.i.d., distributed as some r.v. H. To 
each Hi corresponds an individual, that we call individual i. We also assume that mutations occur 
according to a Poisson point process on edge lengths with parameter 9. 

3.1 The next branch with no extra mutation 

We let £ 6 denote the set of individuals who carry no more mutations (but possibly less) than 
individual (some of and at most exactly the mutations carried by 0, but no other mutation). We 
call such individuals (0,-)-type individuals (same type as some point on lineage at some time in 



Set Kq := and for % > 1, define Kf as the label of the z-th individual in E . In addition, set 



See Figure [2] for a graphical representation of these quantities on a typical coalescent point process 
with mutations. 



the past). 



H\ := max{^ : K° < j < K? +1 } 



and 




■6 

i-l- 



We write (B , H e ) in lieu of (Bf, Hf) and we define W${x; 7) by 
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We will also need the following notation 



W(x ]7 ) : 



1 



x > 0,7 G (0,1]. 



1 - jF(H < x) 
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Figure 2: On this coalescent point process, the 8-th individual is the first one whose type is the 
same as some point on lineage anywhere in the past, so that 8 E £ e and Bf = 8. The maximum 
Hf of the first Bf branch lengths is shown. Also note that 10 € £ 9 and B® = 2. 



Theorem 3.1 The bivariate sequence ((Bf , Hf);i > 1) is a sequence of i.i.d. random pairs. In 
addition, the following formula holds for all x > and 7 6 (0, 1] 

W e (x- 7) = e~ ex W(x- 7) + 6 f W(y; 7) e~ 9y dy. 

Jo 

Remark 3.2 Differentiating both sides of the previous equation w.r.t. the first variable yields 

dW e (x;~f) = e - 9x dW(x;-/). 
Remark 3.3 The formula in the previous statement was shown in I13f in the special case 7=1. 



Proof. First observe that the pair (Kf , Hf) does not depend on the haplotype of individual 0, and 
that the i-th (0, -)-type individual is also the next individual after Kf_ 1 with no mutation other than 
those carried by individual Kf_ 1 . This ensures that (Kf — Kf_ 1 , Hf) has the same law as (Kf, Hf), 
and the independence between (Kf — Kf_ 1 ,Hf ) and previous pairs is due to the independence of 
branch lengths and the fact that new mutations can only occur on branches with labels strictly 
greater than Kf_ x . 

As for the formula relating W 8 and W, we consider the renewal process 5 defined by S*o = and 
S n = Y17=i Bf - Next, for any integer k > 0, let denote the event 

F k := {3n > : S n = k, M n < x}, 
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where M n := max{Hf : 1 < i < n}. Let Tk denote the time elapsed since the lineages of individual 
and individual k have split up, that is, T)~ = max{i7,, : 1 < i < k}. Notice that by definition of Hf, 
Tk = M n on the event {S n = k}, so that 

F k = {3n >0:S n = k, T k < x}. 

So F k is the event that the lineage of individual k has had no mutation between time — Tk and present 
time (i.e., no mutation on the part of its lineage not common with individual 0), and Tk < x. By 
standard properties of Poisson processes, we get 

P(F fc ) =E(e~ eT *,T k <x\ 



H < x) k e~ 9x + f P(H < y) k e~ dy dy. (3.1) 
Jo 



Note that the r.h.s. of this equation is obtained using the integration by parts formula for cadlag 
functions (i.e. functions continuous on the right and admitting left limits at each points of the space, 
like ¥(H < x) ): if / is continuously differentiable and g is cadlag with bounded variation, 

f(x)g(x) = /(0) 5 (0) + f f(y)g(y)dy + / f(y)dg(y). (3.2) 

JO J(0,x] 

Equation (|3.ip yields 

V 7*P(if fc ) = e- ex W(x; 7) + fif W(y; 7 ) e~ ey dy. 
k>o Jo 

On the other hand, 

fc>0 k>0 n>0 



^E( 7 s ",M n <x) 

n>0 

J2^(l^ B lHf<x,...,H^<x 

n>0 



n>0 



which yields the desired result. □ 

3.2 Expected haplotype frequencies for geometrically distributed population sizes 

Let X denote some independent geometric random variable with parameter 7, that is, ¥(X > n) = 7™ 
for any n > 0. 

In the infinite-allele model, each haplotype is characterized by its most recent mutation. We 
denote by Ag(k,y; , y) the number of haplotypes whose most recent mutation occurred between time 
— y and present time and which are carried by k individuals among {0, X}. 



8 



Theorem 3.4 For all k > 1, y > 0, 7 G (0, 1], u € [0, 1], 



fc-i 



Let 7) (resp. /g(y; n)) denote the number of individuals among {0, 1, ... , X} (resp. {0, 1, ... , n}) 
whose most recent mutation appeared between time — y and present time 0. 

Let Ag(y; 7) (resp. Ag(y; n)) denote the number of distinct haplotypes represented in {0, 1, ... , X} 
(resp. {0, 1, . . . ,n}) whose most recent mutation appeared between time — y and present time 0. 

Let Gg(y; n) denote the probability that two distinct individuals randomly drawn from {0, 1, • • • ,n} 
share the same haplotype and that the most recent mutation of this common haplotype appeared 
between time — y and present time 0. 

Corollary 3.5 For any integer n > 1, 

EiS(i/;n-l) = n(l-exp(-0y)), 

E A e (y; n - 1) = n J dx 9 e' 6x F{H e >x) + j\x9 e~ 0x E (V A n, H e < x^j . (3.3) 
and in the case where the law of H has no atom 

E G e (y; n - 1) = 2 V ki f - k } T F(H e dx) F(H < x) k ~ l e - ex ( e~ 9x - e~M . 

^ n(n -I) J V J 

Remark 3.6 The first expectation can readily be deduced from some exchangeability argument, since 
each individual carries a mutation with age smaller than y with probability 1 — exp(— 6y) (there is no 
edge effect since the ancestral lineage is infinite). 

Remark 3.7 In 113], a pathwise result was shown for the number Ag(oo,n) of distinct haplotypes 
represented in {0, 1, ■ ■ ■ , n}, namely 



/•CO 

lim n _ %(cx),n) = / dx9e~ dx F{H e > x) 

n^co J Q 



a.s. 



Remark 3.8 In the case where the law of H admits atoms, the computation ofE,Gg(y;n — l) can be 
done following the same line as in the proof below, using the fact that dW(x; 7) has an atomic part. 
The computation gives 

E G e {y, n - 1) = 2 V ^ - k \ Ik F ^ (dx) F(H < x) fc - 1 e" te ( e~ ex - e~ 6y ) 
^n(n-l) I Jo v ' 

+ < xf- 1 - F{H < x) k -^ e- 0x [e~ ex - e~ 9y 

xe[0,y] 

where fj,"jj a ' is the non-atomic part of the law of H. 
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Proof. For the first expectation, taking u = 1 in the theorem, 

EI l e (y; 1 )=Ej2kA e (k,y; 1 ) = 1 ~_ 6 - , 
fc>i ^ 

using repeatedly Fubini-Tonelli theorem and J2k>i kx k_1 = (1 — x) -2 for any x G [0, 1). The result 
then follows from the inversion of the generating function using (1 — 7)" 1 = J2 n >o( n + 1)(1 — 7)7™- 
For the second expectation, 



EA e ( yn ) = Ej2Mk,y;i) = j^— fdxee-^ ) = — L_ f ' dxOe-** (l - E ( 7 B< > e < 
^ 1-7 Jo We(x;7) 1-7 Jo v v 

Next invert the generating function as follows 

Y~Z — ^ (V* ,H e <x)=Y,{n + l){l- 7)7" £ ^ = * ^ ^ ^ 

^ n>0 i>0 

n 

= £(1 - 7)7™ £( n + 1 " fc ) p ( 5e = k,H e < x) 

n>0 k=0 

£(1 - 7)7™E (n + 1 - B 9 , B 9 <n,H e <xj , 



n>0 

which entails 
E 



A (y;n) = J dx 9 e~ 6x (n + 1 - E (n + 1 - B e , B e < n, H e < x)) 

= / dx9e~ 9x {{n + l)^{H 9 > x) + E (n + 1 - (n + 1 - B 9 )t {B e< n} , H 9 < x)) 

= f V dx9 e~ 9x ({n + 1)F(H 9 > x) + E ((n + l)l {B * >n} + i? e l { ^< n} , F e < z)) , 
which yields the result. 

For the third expectation, we use the fact that the expected number of (unordered) pairs of 
individuals sharing the same haplotype (younger than y) equals 

£(1 - lh n! ^^Ge(y; n) = E[G e (y; 7)] , 



n>0 

where 



G,(y;7) :=E^V^^(^y;7). 



fc>2 



Now since Y^k>2 ^(^ ~~ l) 2 ^ 1 = ~~ x ) 3 ' we 



1 




7 




1 




1 




7 




1 





[ V dx9e~ 9x [ e~ 9z dW(z;j) 
Jo Jo 

— ^— r dW{z;j)e- 9z (e- ez -e~ 9y ), 
1 7 jo v 7 
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where differentiation of W is understood w.r.t. the first variable. Then we use the fact, when the 
law of H has no atom, 



The proof ends writing the product series between the last entire series and (1 — r y)^ 2 = J2 n >o( n + 
l) 7 n . □ 
Before proving the theorem, we insert a paragraph in which we state and prove a preliminary- 
key result. 

3.2.1 A key lemma 

We denote by ti the time elapsed since the i-th most recent mutation on the lineage of individual 0, 
also called lineage 0. Let Ni(y; 7) denote the number of (0, -)-type individuals in {0, 1 ... , X} whose 
most recent mutation time in its haplotype is li if li < y, and iVj(y; 7) = otherwise. 

We also define (0, y)-type individuals as those individuals that have the same type as the point at 
time — y on lineage 0. In other words, an individual is of (0, y)-type if the most recent mutation of its 
haplotype is li for the unique i such that li-\ < y < li- In the same vein, (0, [0, y])-type individuals 
are those individuals that have the same type as some point on lineage at any time between time 
— y and present time 0. 

We denote by Zo(y; 7) the number of (0, y)-type individuals of {0, 1, ... , X}. Note that Zo(y; 7) = 
iVj(y;7) where i is such that < y < £{. Also set Io(y;j) the number of (0, [0,y])-type individuals 
of {0,1,..., X} and Iq{u; 7) the number of (0, -)-tyP e individuals of {0, 1, ... , X} whose most recent 
mutation appeared between time — y and present time 0. Otherwise said, 

i {y;j) = io{y;i) + Zo{y;i) and i' {y;j) = ^ JV»(j/;7) 

Lemma 3.9 For all k > 1, y > 0, 7 G (0, 1], u € [0, 1], 
S"E(u x ,Ni(y,j) = k) = ~ 7 / dz0„ 



i>l 



c _ ez W(z;u 7 ) ( 1 ' 1 



and 



Corollary 3.10 For all y > 0, 7 G (0, 1], u € [0, 1], 

E (u x I^y; 7)) = — / dz 9 e~ 9z W{z; u 7 ) and E (u x Z (y; 7)) = — ZJ- e -<>v W(y; uj). 
1 - wy J l-wy 
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Proof. Use the formulae in Lemma [3.91 and Fubini-Tonelli theorem repeatedly, in particular to see 
that 

El' (y n )u x = ^Eu X iV i (y; 7 ) = ]T ]T Mi U x % (y;7)=fe = ^^E/l %;7H . 

i>l i>l k>l k>l i>l 

The proof ends using Ylk>l kx k ~ 1 = (1 — x)~ 2 for all x G [0, 1). □ 

Let n be a non-negative integer. In the next corollary, Zo(y;n) denotes the number of (0, y)-type 
individuals of {0, 1, ... , n} and Jq(t/; n) the number of (0, -)-type individuals of {0, 1, ... , n} whose 
most recent mutation appeared between time — y and present time 0. 

Corollary 3.11 For all y > and n > 0, 

ovy ' ; 7 P(# > z) yy ' P(H > y) 

Proof. We use (1 — 7) -1 = Yltk>o1 k a l° n g with 

' v — ; n>0 

Plugging these equalities into the first formula of the first corollary evaluated at u = 1 yields 



Ei£(y; 7 )= ( V dz6e' ez -^—W{z- 1 )= F dz 9 eT dz ^ 7 n ^ P(iT < 

^° _ ^ ^° n>0 fc=0 



Inverting the generating function yields the expression proposed for ~EI' (y;n). The very same line 
of reasoning can be applied to get E Zo(y; re). □ 

Remark 3.12 Keeping the expression in the proof of the theorem under the shape of a sum is more 
informative. Indeed, differentiating each side of the equality, we then get 

n 

El' (dy;n) = ^ e - e ^P(F < y)\ 

k=0 

where I'^dy; n) denotes the number of (0, -)-type individuals of {0, 1, ... , n} whose most recent muta- 
tion is of age in (y, y + dy) . The interpretation of this new expression goes as follows. The term 9 dy 
is the probability that a mutation occurred on lineage in the time interval (y, y + dy) backwards in 
time; the term V(H < y) k is the probability that the lineage of individual k split off lineage more 
recently than y; the term e~ dy is the probability that the lineage of individual k has undergone no 
mutation in the last y units of time. 
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Proof of Lemma 13.91 Set D\ := 1 and for i > 2, 

Di := min{j > 1 : #? > i^ x }. 
Also recall the renewal process S n = Y17=i Bf- Then we have for all i > 1 

/ A+i-i \ 

Ni(y;j) = t k < y I l i= i + l 5j <x I , 

the indicator function of % = 1 being due to the count of individual in that case. First, we 
work conditionally on the values Vi of the ages li of mutations of lineage 0. Using repeatedly the 
lack-of-memory property of X, we get for all i > 2 and k > 1 

E (u x , Ni(y; 7) = fc | lj = Vj ,j > l) = • • ■ 

• • • 1^ E (u 5 ^ 1 , X > E (u bB ,B e < X,H e < Vi \ H e > x • • • 

••• x E (u B ",B e <X,H e < 1 (e (u x ,B e > + e(u x ,B 6 <X,H e > 



where the last multiplicative term equals 

E (u x ,B e > x] + E (u x ,B e < X, H e > Vi ) = E (u x ) — E (u x ,B e < X, H e < v, 



E (u x ) ( 1 - E ( u D \ B° < X, H° < v., 



1_ 7 (-, tv(,_.sB>* 



1 — WJ 

1-7 



1 - E (uj)" , H < V; 



(1 - wy)W e (vi;wy) ' 

Similarly for i = 1 and k > 1, 

E(u X ,Ni(y;j) = k \ £j = vj,j > 1) = l^j, E (u 5 *, £ e < X, F e < v x Y 1 E (u x ) x ••• 

•••x (l-fifV^S 9 <X,H e <vi 

= W if < (^^^.^ • 

Now elementary probabilistic reasoning shows that for i > 2 

E (A- 1 , X > S Di ^ I & = Vj ,j > 1) 



l 1 Zj = Vj, 

(w(H e < vi-i)) *"* ¥(H e > Vi -i)E (u BB , B e < X \ H e < *'~ ' 



fc>l 

l-E({ U1 ) B \He <v^) 



F(H e > Vi-jWoivi-nurf). 
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As a consequence, for all i > 2, 

E (u x , Ni(y; 7) = k \tj = u,-, j > l) = • • 
1-7 W e (vi-i\uj) 
Vi - y l-wy W e (vi;uj) 



E((tt 7 ) B ,H" < Vi )) E[{u 1 )"\v l . l <H ti < Vl ) , 



k-l 



\ B 



whereas 



E (u x ,Ni(y;-f) = k \ l s = Vj ,j > l) = 1^ - 



1-7 



1 



wy Wg (vi ; wy) 

It is well-known that for the Poisson point process of mutations, 



E ( (u<y) B " , H u < Vl 



k-l 



P(A-1 G G <fo) 



■e dxdz 0<x<z,i>2, 



(t-2)l 



so that 

^1^,^(2/57) ^) = ^E ( 
where 

Since 
we get 



ffe / dx — — ^tt e~ ' 



o (i-2)! W e (z;u 7 ) V W fl (2;;u7) 



fc-l 



F e (x,z;n7) := iy e (x;u7)E ((u7) Be ,£ < H e < z) . 



F g (x,z;wy), 



(3.4) 



E [u x ,Ni(y; 7) = k) = - 



1-7 



"7 Jo 



y 1 



W (z;n7) V W (z;u7) 



fc-i 



VE(^,iV i (y;7)=fc) = -L_X T 



e 



-02 



1 



i>l 

Now observe that 



1 



Wg(z;wy) V Wg(z;wy) 



k-l 



1 + 6 I dxe ex F e (x,z;wy) 



Fg(x, z; wy) = Wg(x; wy) ( E ( (wy) B , H° < z ) - E ( (117)"" , H° < x 



Wg (x; wy) 
1 



1 



1 



Wg(x;wy) Wg(z;uj) 
W 6 (x;wy) 



Wg(z;wy) ' 

so that the integration by parts formula (j3.2j) yields 

Wg(x;wy) 
Wg(z;wy) 



1 + 6 I dxe 8x Fg(x,z;u~f) = 1 + 



e dx {l 



+ 



Jo W & {z;wy) 



dWg(x;wy), 



where differentiation of W is understood w.r.t. the first variable. Since by Theorem l3-H dWg(x; wy) 
e~ ex dW(x;wy), we get 



1 + 6 I dxe ex Fg(x,z;wy) 



W(z; wy) 
Wg(z;wy) ' 



(3.5) 
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which ends the proof for the first formula. Let us turn to Zo(y;7). The same kind of reasoning as 
previously shows that 

E (u x , Z {y; 1 ) = k\l j = Vj ,j > l) = ■ ■ ■ 

Vx^E (u s ^-\X > (E (u B \B e <X,H e <y\H e > Vi A 1;> 2 + 1 



i>l 



X • • • 



• • • x E [u B ,B° < X,H° <yj [E lu A , B° > X) + E ( u A , B° < X, H° > y 
Referring to the calculations above, we easily get 

E (u x , Z (y; 7) = k \ lj = Vj ,j > l) = ^ K-i ' 



k-l 



.X rjd 



X 



i<y<Vi 



i>l 



We{y;ui) 



••• x l 



1 



fc-i 



li=i + li> 2 W e (« i _i;w7)E (u 7 ) s ,v f _i <H e <y 



We{y;wy) 

Integrating over the law of the Poisson point process of mutations yields 

fc-i 



E{u X ,Z (y; 1 )=k) = J-JL e -% 



1 



1 



1 



+ 



1-7 

1 — 4rf 



147 



j>2 



^7 ^e(y;^7) V w^(y;«7) 

q~ 1 



y JO 



dz / cfe 



(i-2)! Wfl(y;u7) \ Wg(y;wy) 



k-l 



F e (x,y;uj), 



where Fg was defined in (|3.4p . Thanks to equation (|3.5p . we get 



E{u x ,Z (y; 1 ) = k) = ^Le- e y 



1 



^7 We(y;«7) V W e (y;uj) 



1 



1 



fc-l 



1 + 6* / dxe ex F(x,y;wy) 



1-7 ^(y;^7) 



l-«7 TV e (y;u7) 2 
which is the desired formula. 



1 



1 



W e (y;u~f) 



k-l 



□ 



3.2.2 Proof of Theorem ET51 

Let M n (k, y; 7) denote the number of haplotypes whose most recent mutation occurred between time 
— y and present time on the n-th branch (with i.i.d. lengths H n , except Hq = +00), and which are 
carried by k individuals among {0, 1, . . . , X} (hence among {n, n + 1, . . . , X}). In particular, 

Mkvn) = ^ M n (k, y; 7). 

n>0 

First, 

M {k,y;j) =^2^N l ( y , y )=k, 

i>l 

so thanks to Lemma 13.91 

E(u x M (k,y; 1 ))= [ dz F(k, z;wy), 
Jo 
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where we have used the following definition 

F(k , z ., „ 7) := J—L , c -» ( , _ _L\ k -\ 

1-wy W e (z;wy) 2 \ Wo(z;wy)J 

Second, for all n > 1, by the lack-of-memory property of the geometric variable X, 

E {u x M n (k, y; 7)) = u n F{X > n) f F(H n G dx)E {u x M (k, x; 7)) + F(H n > y)E {u x M (k, y; 7)) 

Jo 

= {wy) n [ y F(H£dx)[ dzF(k,z;uj) + F(H > y) I dz F(k, z;uj) 
Jo Jo Jo 

= (-u 7 ) n / dz F(k, z; wy) F(H > z). 
Jo 

Now since A e (k,y;-/) = En>o M "( fe '2/i7), w e get 

E{u x Ae(k,y;~i)) = [ dz F(k, z;uj) + V(n 7 ) n f dz F(k, z;uj)F(H 
Jo ~rt Jo 



n>l 



1 - : U1 F(H > z) 



dz F(k,z;uj) 

1 — wy 

dz F(k, z; wy) [(1 — wy)W(z; n7)] _1 , 



hence the result, recalling the definition of F. 



□ 



4 Splitting trees: Expected haplotype frequencies at fixed time 

4.1 Joint expected haplotype frequencies with population size distribution 

In this subsection, we apply the results of the previous section to a splitting tree started at time —t 
from one single individual and conditioned to be extant at present time 0. Then the population at 
present time is {0, 1, . . . , N t — 1}, where N t is the population size and N t — 1 follows the geometric 
distribution with parameter 

lt := F(H <t) t > 0, 

that is, F*(Nt — 1 > n) = 7™ for any integer n > 0, where P* denotes the probability conditional on 
the population being extant at time 0. We recall that, in the case of splitting trees, the law of the 
branch lengths H is always absolutely continuous w.r.t. Lebesgue's measure. 

The difference with the previous section is that the lengths of branches are (still i.i.d. but) distributed 
as H conditional on H < t. As a consequence, everything we have done in the previous section holds 
for the standing population of a splitting tree founded t units of time ago and conditioned upon 
survival up to t, replacing 7 with 7^ and W with (from Theorem 13, ip 

W {t \x;a) :-- 



1 - aF(H < x \ H < t) 



x G [0,t],a G (0,1]. 
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In particular we now use wjp instead of Wg, with 



W e {t) (x; a) = e - dx W^ {x; a) + 6 dy W {t) (y; a) e~ dy 



We call a derived haplotype a haplotype which is different from the ancestral haplotype. Noticing 
that (x;wy t ) = W(x;u), we also have Wq 1 \x\ wy t ) = Wg(x;u), where we stick to the notation 
from the previous section, namely, 

W(x;u) = - J r x>0,t*€(0,l], 

1 — u¥{H < x) 

and (from Theorem 13.11 again) 

W e (x; u) = e- dx W{x; u) + 9 dy W(y; u) e~ ey . 

Jo 

Then the following statement stems readily from Theorem 13.41 and Lemma 13.91 Recall that W(x) = 
W(x; 1) and that Wg{x) = Wg(x; 1). 

Proposition 4.1 Let Ag(k,t) denote the number of derived haplotypes represented by k individuals 
in the standing population of a splitting tree founded t units of time ago and Z${t) the number of 
individuals in the standing population carrying the ancestral haplotype. Then for all t > and 
«€(0,1], 

r (,-%<M» - ^ f^w^ (' " w^f ' 

and 



r(«^z (t) = ib) = i^- i ^-^ i 



Remark 4.2 A/bi to overload with notation, we have not considered the alleles of age less than y. If 
Ag(k,y,t) denotes the number of derived haplotypes of age less than y, represented by k individuals in 
the standing population of a splitting tree founded t units of time ago, then we get the same formula 
as in the previous statement, but where the upper bound of the integral has changed 



Wg(x;u) 



The following corollary is obtained by taking u = 1 in the last statement. A more explanatory proof 
is given in the next subsection. 

Corollary 4.3 We have 
and 
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The same kinds of calculations as those done for the corollaries of the previous section yield the 
following statement, where the first equation could readily be deduced by exchangeability arguments. 



Corollary 4.4 Recall that Zq{€) is the number of individuals in the standing population carrying the 
ancestral type and set Ag(t) the number of derived haplotypes represented in the standing population. 
Then for any positive real number t and positive integer n, 

E(Z (t) \N t = n)= nexp(-0t) 

and 



E(A e (t) \N t = n) = n j dx 9 e~ 0x E (l - F(H < t)~ B<> l {H e< x} ) 



+ J dx6 e~ 6x E ((b An) F(H < t) _B ' , // < r 

Proof. The first result is clear letting y go to +00 in Corollary [37TTJ In view of (|3.3[) in Corollary [331 
in order to prove the second result, we only need to check that 

F(H >x)=e(i~ F(H < t)- B \ {H e< x} 

and 

E(B e An,H e <x)=e((b An) F(H < ty B<> ,H < x) , 



where P is the law of the coalescent point process when the r.v. (Hi) are i.i.d. with common law 
F(H G • I H < t). Now, 

F(H 9 <x)= F(H <x\Vi<B e , H, < t) 

= J2F(B = k, H e < x)F(H < ty k 



k>l 

-B B 



= E(l-F(H<tr» l {H e< x} ). 
The second equality, very similar, is left to the reader. □ 

Recall that Gg(t) denotes the (absolute) homozygosity in the standing population, that is, 

Gt(t) = W.M-i) +E M^ (M) , 

k>2 

then we easily get 

Proposition 4.5 For all t > and u G (0, 1], 

r (u N ^G e (t)) = ^^(W 2e (t;n)-l). 

Note that explicit formulas can also be obtained for the expectation of the standard homozygosity 
Gg(t) = 2Gg(t)/Nt(Nt — 1), which is the probability that two randomly sampled individuals in the 
population at time t have the same haplotype. Formulas are given in Section [6l where they are 
obtained thanks to an alternative proof based on moment generating function computations. 
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Proof. We use Proposition 14,11 and the fact that Ylk>2 ^(^ ~~ ^) xk 2 = V(l ~~ x ) 3 - An integration 
by parts yields 



E («*-*G,(t)) = e -«(Wi(t; u) - 1) + ^ " 



^^. e -^(,n)-l) + ^^ 



W(t) 



2 rt 

- / dx^e- fe (^(x;u) - 1) 

J 



W(t) 



-0s 



(^(x;n)-l^ 



+ / dse-" a W,5(x;u) , . 
Jo 



where differentiation is understood w.r.t. the first variable. Recalling that Wg(x;u) = e W'(x;u) 
provides the announced formula. □ 



4.2 An explanatory proof of Corollary 14.31 

Consider the standing population at time t conditioned on being nonempty (probability measure 
¥*), For any real number y 6 (0, i), for any non- negative integer i, let Ci(y;dy), Di(y) and Ei{y) 
denote the following events 

Ci(y; dy) := {i < Nf — 1, the i-th branch length has size Hi > y 

and carries a mutation with age in (y, y + dy)} 



Di(y) := {the type carried by the lineage of the i-th individual at time t — y 

has at least one alive representative} 

Ei(k,y) := {the type carried by the lineage of the i-th individual at time t — y has k alive representatives} 

Then define Ag(k, t, y; dy) as the number of haplotypes of age in the interval (y, y+dy) represented by 
exactly k alive individuals at time t. Hereafter, we compute the expectation under P* of Ag(k, t, y; dy). 
The result will follow from the equality 

A e (k,t) = [ A e {k,t,y;dy). 
Jo 

Now it is readily seen that 

A e (k, t, y; dy) = ^c,{y,dy)r\E t (k, y ) 
«>o 

so that 

E*Ag(k, t, V, dy) = ^ P*(Q(y; dy) D Ei(k, y)). 

i>0 

Next observe that Ei(k,y) C Di(y), so that 

¥*(C i (y;dy)nE i (k,y))=F*(C i (y;dy))F*(D i (y) \ C^y- dy))F*{Ei(k, y) | A(y) D d(y; dy)) 

= r(C i (y;dy))r(D (yW(E (k,y) | D (y)). 
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Thus, we record that 

E*Ag(k,t,y;dy) = F* (D (y))F* {E Q (y) \ D (y)) P*(Q(y; (%)). 



i>0 



We will now prove the three following equalities 



Y,^{C l (y ] dy)) = 9dy 



i>0 



P*(A)(l/)) 
'(E (k,y)\D (y)) = 



W(t) 

w( y y 

W(y) e~ e y 
W e (y) ' 



W e {y) 



W e (y) 



k-l 



(4.1) 

(4.2) 

(4.3) 
(4.4) 

(4.5) 



These three equalities, along with (j4. 1 j) . yield the expected expression 

e ~e y / 1 \k-i 

which now sheds light on the meaning of each of the terms in the formula given in Corollary 14.31 Let 
us now prove equations (|4.2|) . (|4.3p and (|4.4p . First, 

r(Ci(y- dy)) = F*(N t -l>i)0dy (l i=0 + 1*>iP(# >y\H<t)) 

/ l l \ 



1 



Ody 



1 



W(t) 



'dy l i=0 + l»>i 



W(y) WJt) 



1 



1 



1;=0 + li>i 1 



W(t) 



i-l 



W(t) 
1 



1 



W(y) W(t) 



so we get (|4T2|) . 

Second, let L denote an independent exponential r.v. with parameter 9, so that (y — L) + is the 
age of the oldest mutation on lineage with age smaller than y, with the convention that this age 
is zero when there is no such mutation. Then either L > y, and Dq(jj) is realized because lineage 
has carried the same type since time t — y, or L < y and D${y) is realized iff the next branch with 
no extra mutation than for which the maximum of past branch lengths exceeds t — L satisfies that 
this maximum does not exceed y (see Subsection 13. ip . Conditional on L = x, this last event occurs 
with probability F(Hg < y \ Hg > y — x). As a consequence, we get 



P*(A>(v)) 



e - y + 



dx8 e 



-0X 



W e (y) 

P -dy 



1 



dx9e- dx W e {y-x) 



W e {y-x) 
W e {y) 



o 



y 



du9e 6u W e (u), 



W e {y) 

and an integration by parts using the relationship between W and Wg (see Remark I3.2p yields (|4.3 
Finally, (|4,4p stems from the definition of Wg (see again Subsection 13. ip . 
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5 Splitting trees: A.s. convergence of haplotype frequencies 



In this section, we rely on the theory of random characteristics introduced in the seminal papers 
|1CH [T5] and further developed in [12] and especially in [18], where the emphasis, as here, is 
on branching populations experiencing mutations (but there the mutation scheme is different, since 
mutation events occur simultaneously with births). 

We will assume that the splitting tree starts at time with one individual. Then recall from the 
last subsection that Nt denotes the number of individuals alive at time t, A$(t) denotes the number 
of derived haplotypes carried by alive individuals at time t, Ag(k,t) denotes the number of derived 
haplotypes carried by k alive individuals at time t, and Zq{€) denotes the number of alive individuals 
at time t carrying the ancestral haplotype. 

For any individual i, in the population, we let Xi{t) (resp. Xiify) be the number of mutations 
that i has experienced during her lifetime that are carried by alive individuals (resp. by k alive 
individuals) t units of time after her birth {xi{t) = if t < 0). Then x an d the x k are individual 
random characteristics, in the sense given in the previously cited papers. In particular, 

i 

and 

A e (k, t) + l Zo (t)=k = ^2xHt- 
% 

where Oi denotes the birth time of i and the sum is taken over all individuals, dead or alive at time t, 
in the population. This allows us to make use of limit theorems for individuals counted by random 
characteristics proved in |10t H H 115]. using the formulation of [18, Appendix A]. 

Recall that b is the birth rate of our homogeneous Crump-Mode- J agers process, that V denotes 
a random lifetime duration, and that a denotes the Malthusian parameter, which satisfies ip(a) = 0, 
where tp is defined in (j2.2f) . 

Let us restate the results in [18, Appendix A] in our setting. Set 

/3:= / «e- a Xii), 

J(0,oo] 

where the last integral is a Stieltjes integral w.r.t. the nondecreasing function 

H(t) = E(# offspring born on (0, *]) = bE(t A V) = I (r A i)A(dr). 

7(0,+oo] 

Also for any individual random characteristic, say x> define x(a) as its Laplace transform at a 

X(a) := [ dte- at X (t), 

J(0,+oo) 

where it is implicit that x is the characteristic of the progenitor (born at time 0). Hereafter, we 
apply Theorems 1 and 5 of |18[ Appendix A]. These theorems need some technical assumptions to 
hold, which we verify at the end of the proof of the next statement. These theorems ensure first that 

lim e- at EA e (k,t) = X } ] 
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and second that, on the survival event, 

Ag(k,t) Eg(g) 
nm — ; . r = - a.s. 

t^oo Ao(t) Ex(a) 

In addition to verifying the validity of the aforementioned technical assumptions, it remains to 
compute the quantities /3, Ex(a) and Ex fc (a). With the following definitions, 

/■oo 1 / 1 x k ~ l 

U k := / dx6e- dx — .„ 1 



and 

/•OO 1 

ZZl Jo W e {x) 



k>l 



we have /3 = tp'(a)/a, Mx k (a) = Uk/b and of course E%(a) = U/b. This can be recorded in the 
following proposition. 

Proposition 5.1 In the supercritical case, 

lim e-°*EMk,t) = (5.1) 
t^oo oip (a) 

lim e- at EA e (t) = (5.2) 
t^oo otp (a) 



and 



And on the survival event, 



y Mk,t) U k 

hm — = — a.s. 



Remark 5.2 iVofe f/ia£ if can 6e shown similarly that 

lim e- at EN t = ——, 
t->oo oip [a) 

and that, for example, 



hm = U a.s. 

t— >oo iVf 

TTiis zs reminiscent of Theorem 3.2 in I13\j where the same limit is obtained after conditioning on the 
population size to equal n and letting n — > oo. This a.s. convergence is made possible by embedding 
all populations of fixed size on the same space thanks to an infinite coalescent point process: the 
population of size n is that generated by the first n values of the coalescent point process. 

Remark 5.3 In it is proved in the supercritical case (a > 0) that the survival probability is 
a/b and that the scale function W has the following asymptotic behaviour 

1 



lim W(t)e 



-at 



t->oo ip'(a) 

One could have used these two facts and the monotone convergence theorem to recover \5. 1\) and $5.2}) 
from Corollary In the following proof, we prefer to show the agreement with Corollary \4-3\ by 
computing directly /3, E£(a) andE,x k (a). 
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Proof. Let us first prove that /3 = if)' (a) /a. Recalling the definition of (3, we get 

/•OO 

/3 = bE duue" au t {u<v} 
Jo 



A(dr) [ duue- a " 

(0,+oo] JO 

= J_ f A(dr) (1 - e~ ar - are- ar ) 
« iro.+ooi 



'(0,+oo] 

= J_( a _^( a ))_I(l_^( a) ) 

a 

Next let us compute Ex k {d). Denote by R t a ' b ^ the number of individuals alive at time t descending 
clonally from the time interval (a,b). More specifically, for a progenitor individual alive on the 
time interval (a, b) and experiencing no mutation between times a and b, R t a ' b ^ is the number of 
individuals alive at t (including possibly this progenitor) descending from those daughters of the 
progenitor who were born during the time interval (a, b), and that still carry the same type that the 
progenitor carried at time a. In particular, since We is the scale function associated with the clonal 
reproduction process 

P (4 a ' b) = k) = F(N t e _ a = k |C = b - a) 

= ¥(N t e _ a + | C = b - a)P(7Vf_ a = k | N?_ a ± 0) 

/ W e (t-b) \ ( 1 \ k ~ x 1 

-\ l ~ lt>b We{t-a))\ l - We{t-a)) W e (t - a)' (5 " 3) 

where N 9 is the population size process of a clonal splitting tree and £ is the lifetime of the progenitor. 
Now let us start with a progenitor with lifetime distributed as V and denote by l{ the time of the 
i-th. point of a Poisson point process with intensity (the i-th mutation of the progenitor). Then 

/*0O 

E?(«) = E/ dte-*£l {/4<VAt} l(i^™' +l) =k) 
Jo i>i 

= E l dte ~ at E I dz I d y Hv<v^ 1 K* yAz) = fc ) 



i>l 

/•oo poo pzAVM . . 

E y dt e" * y dz# e- 0z J dy 6 1 (i^' 1 ^ = fc) 

/•oo fVgAt 

J dt e~ at J dy 6 e e y 1 [R^ Ve) = k) , 
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where Vg denotes the minimum of V and of an independent exponential r.v. with parameter 9. Then 

»oo P ft 

\ 

'(0,00) 



E X k (a) = [ dte- at [ F(Vg 6 du) f dy t {y<u} 9 e dy P (r ( ^' u) = k 
JO J (0,oo) Jo ' ^ 

= f dte~ at f F(V e edu) [ dx ts t - x<u }0 e e{t ~ x) P (i? f ( 

Jo J(0,oo) Jo ^ 

= / dx9e- dx F(V e €du) dt e^ 1 P (r? -*' u) = k) 

Jo J(0,oo) Jx ^ ' 



(t-x,u) _ ^ 



'(0,00) 

which, thanks to (|5.3p . yields 



(0,00) 







where 



and 



W e {x) V W e (x)J V ^(x) 

F x (x):= / P(F e edu) / dte (e - Q) ' 

■/ fO.oo) J x 



F 2 (x) := / P(V e e d«) / rfteC^lt^WeC* - u). 

■/(0,oo) 



Let us compute Fi and Set 

ipg(x):=x- [ (l - e~ rx ) bF(Vg e dr) x>0. 
J (0,00) 

Then [13j ip$(x) = xip(x + + 9), and 1/V>0 is the Laplace transform of Wg. Also recall that 

V>0) = 0, so that ipg(a-9) = 0. First, if = a, thenFi(x) = f (0oo) u¥(Vg € du) = (l-^(0+))/6 = 
1/6. Second, if 7^ a, then 

Fi(x) = ^—^ [ W{Ve G du) (l - e-(^)«) = -^—^(a - f - M<* ~ 0)), 
a - V 7(o )0 o) v / 6(a - 6/) 

so that whatever the respective values of a and 0, 

1 



Fi(sc) = T e l 



-a)x 
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We use Laplace transforms to compute F2. For any k > 0, 

f'OO r f'OO rt 

/ dxKe~ KX F 2 (x) = ¥(V e £du) dte {e - a)t W e {t - u) \ dxne~ KX 

JO J(0,oo) Ju Jt-u 

r poo 

= / F(V e £ du) (e KU - 1) / dt e {e - a - K)t W e {t - u) 

7(0,oo) Ju 

r roo 

= / F(V e G du) (e KU - 1) e {e - a - K)u / ds e (e - Q - K)s W (s) 
J (0,00) Jo 



(0,00) 

-(k + a - 6 - + a-9)-(a-9- ^ e (a - 9))) - 

we\n + ol — ") 



so that 



and 



bipg(K + a — 9) b ' 

F 2 {x) = ^ e - a >W $ (x) - J 
W 1 



W e (x) bWeix) 



—Or / i \ fc— 1 

e l 1 x 



As a consequence, we get 

/■oo 

which is the announced U^/b. 

Last, let us check the technical assumptions required for Theorems 1 and 5 in Appendix A] 
to hold. For the first theorem, we have to check the following two requirements 

J2 SU P e- au K X (u) < oo (5.4) 

n>0 

1 1 — y is a.e. continuous. (5-5) 

For the second theorem, we have to check the following two requirements 

3 < rj < a, Esupe -?7 *x(t) < oo (5.6) 
t>o 

3 < i] < a, p,(rf) < oo. (5.7) 
The following equality in distribution is easily seen 



X(t) =^2 1 {T i <tAV}h 



HTt<t^V}H'£ j > 1 N j (t-S j )l {T . < s j< T^ 1 A*A.v}eA} i 
i>l 



where V is distributed as a lifetime, the (Tj) are the ranked atoms of an independent Poisson point 
process with rate 9 (mutation times), the (Si) are the ranked atoms of an independent Poisson point 
process with rate b (birth times), the (iVj) form an independent sequence of i.i.d. homogeneous, 
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binary CM J processes (descendances of daughters), and A is taken equal to N, but can be taken 
equal to {k} in the case of the random characteristic x ■ I n an Y case, x is dominated by a Poisson 
point process with rate 9, so that Ex(i) < Ot. This ensures that (|5.4p holds. As for ()5.5j) . notice 
from the last displayed equation that E^(t) = Yli>\ where 

W) ■= ! r G du, T i+1 G ds) / P(V G dr)P £ iV^t - ^)l {u < Si< . A tAr} £ A . 

JO Ju J[u,oo) \j>l I 

Because Tj has a density w.r.t. Lebesgue measure, each F{ is everywhere continuous on, say, [0, to]- 
In addition, for any t G [0, to], Fi(t) < P(T; <t)< F(Ti < to) and *£i>i < to) = #*o < oo, so we 
get continuity of t \— >■ E%(t) on [0, to] by dominated convergence. Because to is arbitrary, t \— > E%(t) 
is continuous everywhere. 

Let us treat the last two requirements. The last requirement (|5.7p merely stems from the obvious 
inequality fi(t) < bt. To prove (|5.6[) . because x is dominated by a Poisson point process, it suffices to 
show that for any Poisson point process Y with rate 1, say, and for any 77 > 0, Esup t>0 e~ vt Yt < 00. 
In fact, setting M c (t) := e~' nt (Yj + c), we claim that for large enough c, M% is a supermartingale. 
Then using the inequality P(sup t M^(t) > z) < c/z, we get 

P(sup Y t e~^ >y)< P(sup (Y t + c) e""* > y) = P(sup M c 2 (t) > y 2 ) < 4, 

t t * y 

so that E(sup t Yt e~ vt ) < 00. The only thing left to show is that M 2 is a supermartingale. Writing 
(J-t) for the natural filtration of Y and P s for a Poisson random variable with parameter s independent 
of Yt, we get 

E(M c (t + s) 2 I Ji) = e~ 2 ^*+ s )E + c + P s ) 2 ) = e ~ 2 ^(*+ s ) ((Y t + c + s) 2 + s)< M c {tf , 

where the last inequality holds for any s, t > if there is some positive c (depending only on rj) such 
that 

e~ 2vs ((x + s) 2 + s) < x 2 x>c,s>0. 

Then we study the function / : s 1— > x 2 e 2vs — (x + s) 2 — s. Since f"(s) = Ar/ 2 x 2 e 2ris — 2, f is 
nondecreasing on [0, +00) as soon as x 2 > 1/2t] 2 . On the other hand, /'(0) = 2r/x 2 — 1 — 2x. Let x* 
be the largest root of x \-t 2r]x 2 — 1 — 2x. As soon as x > x*, f'(0) > 0. Setting c := max(l/ry-\/2, x*), 
as soon as x > c, /'(0) > and /' is nondecreasing on [0,oo), so that / is nondecreasing on [0,oo). 
Since /(0) = 0, we conclude that / is non-negative on [0, 00), so that M 2 indeed is a supermartingale. 
□ 

6 Expected homozygosities through moment generating functions 

We consider again the coalescent point process of Section [3l constructed from Ho = +00 and the 
i.i.d. sequence of r.v. (flj)j>i, with common law IP (if G •). Let us recall that, in the case of splitting 
trees, the law of H has a density w.r.t. Lebesgue's measure. We introduce the derivative of log W(t): 

p{t)dt = F(H <t + dt\H>t) = W{t)¥(H G dt). (6.1) 
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For any time t, we consider the splitting tree obtained from Hq, . . . ,Hn ± _\, where N '■= ini{i > 
1 : Hi > t}. We then define the (standard) homozygosity Gg{i) as the probability that two distinct 
randomly sampled individuals in the population at time t share the same haplotype, and the absolute 
homozygosity Gg(t) as the number of pairs of distinct individuals in the population at time t that 
share the same haplotype. Note that both of these quantities are on the event {Nt = 1}, and on 
the complement event, 

The notation Gg(t) coincides with that of Subsection 14.11 We also recall that Zo(t) denotes the 
number of individuals sharing the ancestral haplotype, defined here as the haplotype of individual 
at time —t. 

Our goal in this section is to compute M*(Gg(t)) and M*(Gg(t)) using another method than in 
Section El As in |13| . we characterize the joint law of (Gg(t), N, Zq{€)) as time increases in a similar 
fashion as for branching processes, in order to obtain backward Kolmogorov equations for moment 
generating functions involving these random variables. The result will then follow by solving these 
equations. 

Proposition 6.1 For all t > 0, the expected absolute homozygosity is given by 

E* (G e (t)) = W(t)(W 2 g(t)-l), 
whereas the expected standard homozygosity is given by 



W(t) - W(s) 
6.1 Joint dynamics of Gg(t), N t and Z (t) 



logW(t) -logW(s) 



W(t) - W(s) W{t) 



ds. 



Consider two splitting trees of age t, with respective absolute homozygosity, population size, number 
of ancestral individuals and height processes Gg(t), N t , Zo(t), (i?i)j>Q and G' e (t), N[, Z' Q (t), (ff-)j>o- 
We call merger of these two splitting trees the splitting tree obtained from the sequence of heights 
Hq = +oo, H±, . . . , fljv t -i, Hq, H[, . . . , HL,i, where Hq is obtained from the infinite branch H' Q by 
cutting the part below —t. In addition, all the mutation times are kept unchanged on each branch 
of the tree. 

After this merger event, the new splitting tree has population size Nt + Ni, the new number of 
ancestral individuals is Zo(t) + Z' (t) and the new absolute homozigosity is, counting first the pairs 
of ancestral individuals 

(z (t) + z' (t))(z (t) + z' (t)-i) + Qe{t) _ z (t)(z (t) - 1) + G , (f) _ Z (t)(Z>(t)-l) 



2 

Gg(t) + G'g{t) + Z (t)Z' (t). 



Now, we have (Gg(0), Nq, Zq(0)) = (0, 1, 1) and, if the law of (Gg(t), N t , Zo(t)) is known for some 
t > 0, then, on the time interval [t, t + dt], 
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• either a mutation occurs on the ancestral branch, with probability dt, and 

(G e (t + dt),N t+dt ,Z (t + dt)) = (G e (t),N t ,0), 

• either H^ t G [t, t + dt], with probability p(t)dt defined in (|6.ip . and 

(G*(t + dt),N t+dt , Z (t + dt)) = (G e {t) + G' e (t) + Z (t)Z' {t), N t + iV t ', Z (t) + Z£(t)), 
where (G' (t), N[, Z' (t)) is an i.i.d. copy of (Gg(t), N u Z (t)), 

• or nothing happens (the probability that two or more of the previous events occurs is o(dt)). 

In other words, when the ancestral time t increases, the process (Go(t),N t , Zo(t)) jumps to (Gg(t), N t , 0) 
with rate and to (Gg(t) + G' g (t) + Z (t)Z' Q (t),N t + N[, Z (t) + Z' (t)) with instantaneous rate p(t). 

Of course, the previous argument is quite informal, but it could easily be made rigorous by 
considering all the possible events that could occur in the time interval [t, t + s], and letting s — > 0. 
In particular, the Kolmogorov equations of the following subsection can easily be justified this way. 



6.2 Moment generating functions computations 

We define the moment generating functions 

L(t,u) =E*(Gg(t)u Nt - 2 ) (6.3) 
M(t, u, v) = E k (u Nt ~ 1 v Zo( 3), (6.4) 

for all u, v G [—1, 1] and f > 0. Since Gg(t) = if Nt < 1 and the quantities inside the expectations are 
bounded by iV t 2 , these functions have finite values. Our goal here is to compute explicit expressions 
for these quantities. 

Note that, for any i.i.d. triples of nonnegative r.v. (Gq,N,Zq) and (G' e , N' , Zq), 
E((G e + G' e + Z Z' )u N+N '~ 2 ) = 2E(G e u N ~ 2 )E(u N ) + (E(Z u Af - 1 )) 2 . 

Using this equation and the previous construction of the process, we can write the forward Kol- 
mogorov equation for the moment generating functions L and M: for all u,v G [—1, 1] and t > 0, 



(6.5) 



d t L(t, u) = -{0 + p(t))L(t, u)+0 L(t, u) + p(t) 2 u L(t, u) M(t, it, 1) + (d v M(t, u, l)) 2 
L(0,u) = 0. 

and 

id t M{t,u,v) = -{0 +p(t))M(t,u,v) +0 M(t,u,l) +p{t)u{M{t,u,v)) 2 
{M(0,u,v)=v. 

The explicit computation of the solutions of these equations requires several steps. First, for 
fixed u and v, the function M(t, u, v) is solution to an ODE known as Riccati's equation. In the case 
where v = 1, the function f(t) = M(t,u, 1) is solution to 

f=pf(uf-l), 
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which is known as Bernoulli's equation. It can be solved by making the change of unknown function 
/ = 1/f, which makes the ODE linear. This yields 



t N - 1 



f(t) = M(t, u, 1) = U + (1 - u) exp J p(s)ds) = (6.7) 

where we used that p is the derivative of the function log W(t). 

Second, for all u, v 6 [—1, 1], the function M(t, u, 1) is a particular solution of (j6.6j) (with different 
initial condition). Hence, the function g(t) = M(t,u,v) — M(t,u,l) = M(t,u,v) — f(t) solves the 
Bernoulli ODE 

9 = -(0+p- 2upf)g + upg 2 , 
for which the previous trick again works. This yields 

e W (-f*(e+p(s)-2up(s)f(s))d i 

M(t,u,v)=f(t) + 



{v _ i)-i _ n/ *p( S )exp ( - Jo(e+p(r) - 2up{r) f ( T ))dr) ds 
Since uW(s; u)¥(H G ds) is the derivative of log W(-; u), it follows from (|6.7p that 

I p(s)(l - 2uf(s))ds = log W(t) - 2 log W(t; u). (6.8) 
J o 



Hence, we obtain 



, W(t;u) ( e- et W(t;u) 
M{t,u,v) = v ' 1 + 



W(t) y (v-l)- 1 -uf*e- 9s W{s;u) 2 F(H £ds) 



Observing that uW(s; u) 2 F(H 6 ds) is the derivative of W(-;u), an integration by parts and Theo- 
rem [3J] finally yield 



We then compute 



M(t,n,l) = ^|^ = /(t) and fcM(t,«, 1) = =■ ?(<)■ 



Third, the linear equation (|6.5p can be explicitly solved: 



L(t, u) = exp (- J p{s)(l - 2uf{s))ds\ j p(s)q 2 (s) exp (J p(r)(l - 2ti/(T))drj ds. 



Using (|6.8|) again, we obtain 

W(t;^ 2 



L(t,u) 



W(t) 



[ e- 29s W{s;u) 2 ¥(H eds). 
Jo 
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Using integration by parts as above finally yields 

W(t;u) 2 W 2e (t ]U )-l 



L(t,u) 



W(t) 



ii 



(6.9) 



which is consistent with Proposition 14.5 
Fourth, using Theorem 13.11 we have 



W 2e {t-u)-l = _ m W(t;u)-1 + 2Q e _ 29s W(s;u)-1 
u u J u 



du. 



This yields 



L(t,u) 



W{t;uf 



e -26t p ^ < ^ u ) + 26/ f e ~ 28s F(H < s) W(s; u)ds 

Jo 



Writing the product series of (1 — v ) 1 = J2 n >o u ™ ano - 0- ~ v ) 2 = E n >o( n + -O^™ an< ^ observing 



that 



we get 



fc=0 a V k=0 / 



(a - by 



-2QU 



L(t,u) 



(H<t) 
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2W(t) 



^n(n-l)(P(i/<t)n) n - 2 + ^y / dse- 26s F(H < s)x 



n>2 



i>0u 2 



(n + 1)F(H < t) n+2 - (n + 2)F(H < t) n+1 F(H < s) + F(H < s] 



n+2 



P(a < H < t) 2 

Finally, we compute the expected standard homozygosity as follows: by (|6.2p . 

dl (E (G e (t)u Nt )) = L(t,u), or E(G e (t)) = [ du I dvL{t,v). 

Jo Jo 

Integrating (|6.1Up twice and using the equation 

(1 — x) log(l — x) + x = 



■ u 



.10) 



n>2 



n(n — 1) 



yields 



_ e- 2e \W(t) - 1) m W(s) - 1 

E [Ge(t)] ~ M® + 29 Jo ds e W(t) - W(s) 

which ends the proof of Proposition 16.11 



1Q S W(s) 



1 



W(t) - W(s) W(t) 
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